{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<i>Copyright (c) Recommenders contributors.</i>\n",
                "\n",
                "<i>Licensed under the MIT License.</i>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# NAML: Neural News Recommendation with Attentive Multi-View Learning\n",
                "NAML \\[1\\] is a multi-view news recommendation approach. The core of NAML is a news encoder and a user encoder. The newsencoder is composed of a title encoder, a body encoder, a vert encoder and a subvert encoder. The CNN-based title encoder and body encoder learn title and body representations by capturing words semantic information. After getting news title, body, vert and subvert representations, an attention network is used to aggregate those vectors. In the user encoder, we learn representations of users from their browsed news. Besides, we apply additive attention to learn more informative news and user representations by selecting important words and news.\n",
                "\n",
                "## Properties of NAML:\n",
                "- NAML is a multi-view neural news recommendation approach.\n",
                "- It uses news title, news body, news vert and news subvert to get news repersentations. And it uses user historical behaviors to learn user representations.\n",
                "- NAML uses additive attention to learn informative news and user representations by selecting important words and news.\n",
                "- Due to some legal issue, MIND dataset does not release news body. Therefore, we use news abstract instead.\n",
                "\n",
                "## Data format:\n",
                "For quicker training and evaluaiton, we sample MINDdemo dataset of 5k users from [MIND small dataset](https://msnews.github.io/). The MINDdemo dataset has the same file format as MINDsmall and MINDlarge. If you want to try experiments on MINDsmall\n",
                " and MINDlarge, please change the dowload source.\n",
                " Select the MIND_type parameter from ['large', 'small', 'demo'] to choose dataset.\n",
                " \n",
                "**MINDdemo_train** is used for training, and **MINDdemo_dev** is used for evaluation. Training data and evaluation data are composed of a news file and a behaviors file. You can find more detailed data description in [MIND repo](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md)\n",
                "\n",
                "### news data\n",
                "This file contains news information including newsid, category, subcatgory, news title, news abstarct, news url and entities in news title, entities in news abstarct.\n",
                "One simple example: <br>\n",
                "\n",
                "`N46466\tlifestyle\tlifestyleroyals\tThe Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By\tShop the notebooks, jackets, and more that the royals can't live without.\thttps://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata\t[{\"Label\": \"Prince Philip, Duke of Edinburgh\", \"Type\": \"P\", \"WikidataId\": \"Q80976\", \"Confidence\": 1.0, \"OccurrenceOffsets\": [48], \"SurfaceForms\": [\"Prince Philip\"]}, {\"Label\": \"Charles, Prince of Wales\", \"Type\": \"P\", \"WikidataId\": \"Q43274\", \"Confidence\": 1.0, \"OccurrenceOffsets\": [28], \"SurfaceForms\": [\"Prince Charles\"]}, {\"Label\": \"Elizabeth II\", \"Type\": \"P\", \"WikidataId\": \"Q9682\", \"Confidence\": 0.97, \"OccurrenceOffsets\": [11], \"SurfaceForms\": [\"Queen Elizabeth\"]}]\t[]`\n",
                "<br>\n",
                "\n",
                "In general, each line in data file represents information of one piece of news: <br>\n",
                "\n",
                "`[News ID] [Category] [Subcategory] [News Title] [News Abstrct] [News Url] [Entities in News Title] [Entities in News Abstract] ...`\n",
                "\n",
                "<br>\n",
                "\n",
                "We generate a word_dict file to tranform words in news title and news abstract to word indexes, and a embedding matrix is initted from pretrained glove embeddings.\n",
                "\n",
                "### behaviors data\n",
                "One simple example: <br>\n",
                "`1\tU82271\t11/11/2019 3:28:58 PM\tN3130 N11621 N12917 N4574 N12140 N9748\tN13390-0 N7180-0 N20785-0 N6937-0 N15776-0 N25810-0 N20820-0 N6885-0 N27294-0 N18835-0 N16945-0 N7410-0 N23967-0 N22679-0 N20532-0 N26651-0 N22078-0 N4098-0 N16473-0 N13841-0 N15660-0 N25787-0 N2315-0 N1615-0 N9087-0 N23880-0 N3600-0 N24479-0 N22882-0 N26308-0 N13594-0 N2220-0 N28356-0 N17083-0 N21415-0 N18671-0 N9440-0 N17759-0 N10861-0 N21830-0 N8064-0 N5675-0 N15037-0 N26154-0 N15368-1 N481-0 N3256-0 N20663-0 N23940-0 N7654-0 N10729-0 N7090-0 N23596-0 N15901-0 N16348-0 N13645-0 N8124-0 N20094-0 N27774-0 N23011-0 N14832-0 N15971-0 N27729-0 N2167-0 N11186-0 N18390-0 N21328-0 N10992-0 N20122-0 N1958-0 N2004-0 N26156-0 N17632-0 N26146-0 N17322-0 N18403-0 N17397-0 N18215-0 N14475-0 N9781-0 N17958-0 N3370-0 N1127-0 N15525-0 N12657-0 N10537-0 N18224-0`\n",
                "<br>\n",
                "\n",
                "In general, each line in data file represents one instance of an impression. The format is like: <br>\n",
                "\n",
                "`[Impression ID] [User ID] [Impression Time] [User Click History] [Impression News]`\n",
                "\n",
                "<br>\n",
                "\n",
                "User Click History is the user historical clicked news before Impression Time. Impression News is the displayed news in an impression, which format is:<br>\n",
                "\n",
                "`[News ID 1]-[label1] ... [News ID n]-[labeln]`\n",
                "\n",
                "<br>\n",
                "Label represents whether the news is clicked by the user. All information of news in User Click History and Impression News can be found in news data file."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Global settings and imports"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 1,
            "metadata": {},
            "outputs": [
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "/anaconda/envs/tf2/lib/python3.7/site-packages/papermill/iorw.py:50: FutureWarning: pyarrow.HadoopFileSystem is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.\n",
                        "  from pyarrow import HadoopFileSystem\n"
                    ]
                },
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "System version: 3.7.11 (default, Jul 27 2021, 14:32:16) \n",
                        "[GCC 7.5.0]\n",
                        "Tensorflow version: 2.6.1\n"
                    ]
                }
            ],
            "source": [
                "import os\n",
                "import sys\n",
                "import numpy as np\n",
                "import zipfile\n",
                "from tqdm import tqdm\n",
                "from tempfile import TemporaryDirectory\n",
                "import tensorflow as tf\n",
                "tf.get_logger().setLevel('ERROR') # only show error messages\n",
                "\n",
                "from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources \n",
                "from recommenders.models.newsrec.newsrec_utils import prepare_hparams\n",
                "from recommenders.models.newsrec.models.naml import NAMLModel\n",
                "from recommenders.models.newsrec.io.mind_all_iterator import MINDAllIterator\n",
                "from recommenders.models.newsrec.newsrec_utils import get_mind_data_set\n",
                "from recommenders.utils.notebook_utils import store_metadata\n",
                "\n",
                "print(\"System version: {}\".format(sys.version))\n",
                "print(\"Tensorflow version: {}\".format(tf.__version__))\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Prepare Parameters"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 3,
            "metadata": {
                "tags": [
                    "parameters"
                ]
            },
            "outputs": [],
            "source": [
                "epochs = 5\n",
                "seed = 42\n",
                "batch_size = 32\n",
                "\n",
                "# Options: demo, small, large\n",
                "MIND_type = 'demo'"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Download and load data"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 4,
            "metadata": {},
            "outputs": [
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "100%|██████████| 17.0k/17.0k [00:01<00:00, 10.0kKB/s]\n",
                        "100%|██████████| 9.84k/9.84k [00:01<00:00, 8.42kKB/s]\n",
                        "100%|██████████| 95.0k/95.0k [00:06<00:00, 14.5kKB/s]\n"
                    ]
                }
            ],
            "source": [
                "tmpdir = TemporaryDirectory()\n",
                "data_path = tmpdir.name\n",
                "\n",
                "train_news_file = os.path.join(data_path, 'train', r'news.tsv')\n",
                "train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')\n",
                "valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')\n",
                "valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')\n",
                "wordEmb_file = os.path.join(data_path, \"utils\", \"embedding_all.npy\")\n",
                "userDict_file = os.path.join(data_path, \"utils\", \"uid2index.pkl\")\n",
                "wordDict_file = os.path.join(data_path, \"utils\", \"word_dict_all.pkl\")\n",
                "vertDict_file = os.path.join(data_path, \"utils\", \"vert_dict.pkl\")\n",
                "subvertDict_file = os.path.join(data_path, \"utils\", \"subvert_dict.pkl\")\n",
                "yaml_file = os.path.join(data_path, \"utils\", r'naml.yaml')\n",
                "\n",
                "mind_url, mind_train_dataset, mind_dev_dataset, mind_utils = get_mind_data_set(MIND_type)\n",
                "\n",
                "if not os.path.exists(train_news_file):\n",
                "    download_deeprec_resources(mind_url, os.path.join(data_path, 'train'), mind_train_dataset)\n",
                "    \n",
                "if not os.path.exists(valid_news_file):\n",
                "    download_deeprec_resources(mind_url, \\\n",
                "                               os.path.join(data_path, 'valid'), mind_dev_dataset)\n",
                "if not os.path.exists(yaml_file):\n",
                "    download_deeprec_resources(r'https://recodatasets.z20.web.core.windows.net/newsrec/', \\\n",
                "                               os.path.join(data_path, 'utils'), mind_utils)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Create hyper-parameters"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 5,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "data_format=naml,iterator_type=None,support_quick_scoring=True,wordEmb_file=/tmp/tmp1_pb4727/utils/embedding_all.npy,wordDict_file=/tmp/tmp1_pb4727/utils/word_dict_all.pkl,userDict_file=/tmp/tmp1_pb4727/utils/uid2index.pkl,vertDict_file=/tmp/tmp1_pb4727/utils/vert_dict.pkl,subvertDict_file=/tmp/tmp1_pb4727/utils/subvert_dict.pkl,title_size=30,body_size=50,word_emb_dim=300,word_size=None,user_num=None,vert_num=17,subvert_num=249,his_size=50,npratio=4,dropout=0.2,attention_hidden_dim=200,head_num=4,head_dim=100,cnn_activation=relu,dense_activation=relu,filter_num=400,window_size=3,vert_emb_dim=100,subvert_emb_dim=100,gru_unit=400,type=ini,user_emb_dim=50,learning_rate=0.0001,loss=cross_entropy_loss,optimizer=adam,epochs=5,batch_size=32,show_step=100000,metrics=['group_auc', 'mean_mrr', 'ndcg@5;10']\n"
                    ]
                }
            ],
            "source": [
                "hparams = prepare_hparams(yaml_file, \n",
                "                          wordEmb_file=wordEmb_file,\n",
                "                          wordDict_file=wordDict_file, \n",
                "                          userDict_file=userDict_file,\n",
                "                          vertDict_file=vertDict_file, \n",
                "                          subvertDict_file=subvertDict_file,\n",
                "                          batch_size=batch_size,\n",
                "                          epochs=epochs)\n",
                "print(hparams)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 6,
            "metadata": {},
            "outputs": [],
            "source": [
                "iterator = MINDAllIterator"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Train the NAML model"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 7,
            "metadata": {},
            "outputs": [],
            "source": [
                "model = NAMLModel(hparams, iterator, seed=seed)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 8,
            "metadata": {},
            "outputs": [
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "1085it [01:49,  9.91it/s]\n",
                        "18693it [01:16, 242.85it/s]\n",
                        "7507it [00:30, 244.22it/s]\n",
                        "7538it [00:01, 6540.79it/s]\n",
                        "2it [00:00, 10.26it/s]"
                    ]
                },
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "at epoch 1\n",
                        "train info: logloss loss:1.4915421532046411\n",
                        "eval info: group_auc:0.5789, mean_mrr:0.2473, ndcg@10:0.3371, ndcg@5:0.2699\n",
                        "at epoch 1 , train time: 109.5 eval time: 116.2\n"
                    ]
                },
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "1085it [01:43, 10.48it/s]\n",
                        "18693it [01:16, 243.31it/s]\n",
                        "7507it [00:30, 247.71it/s]\n",
                        "7538it [00:01, 6653.75it/s]\n",
                        "1it [00:00,  9.91it/s]"
                    ]
                },
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "at epoch 2\n",
                        "train info: logloss loss:1.4155106401663222\n",
                        "eval info: group_auc:0.607, mean_mrr:0.2659, ndcg@10:0.3586, ndcg@5:0.2934\n",
                        "at epoch 2 , train time: 103.5 eval time: 115.7\n"
                    ]
                },
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "1085it [01:43, 10.45it/s]\n",
                        "18693it [01:16, 244.17it/s]\n",
                        "7507it [00:30, 246.78it/s]\n",
                        "7538it [00:01, 5570.62it/s]\n",
                        "1it [00:00,  9.92it/s]"
                    ]
                },
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "at epoch 3\n",
                        "train info: logloss loss:1.3669039504319291\n",
                        "eval info: group_auc:0.6199, mean_mrr:0.2719, ndcg@10:0.3675, ndcg@5:0.3038\n",
                        "at epoch 3 , train time: 103.8 eval time: 115.7\n"
                    ]
                },
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "1085it [01:43, 10.46it/s]\n",
                        "18693it [01:16, 244.82it/s]\n",
                        "7507it [00:30, 245.09it/s]\n",
                        "7538it [00:00, 8483.47it/s]\n",
                        "1it [00:00,  9.64it/s]"
                    ]
                },
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "at epoch 4\n",
                        "train info: logloss loss:1.3437963971344558\n",
                        "eval info: group_auc:0.6329, mean_mrr:0.2883, ndcg@10:0.3828, ndcg@5:0.3193\n",
                        "at epoch 4 , train time: 103.7 eval time: 115.3\n"
                    ]
                },
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "1085it [01:43, 10.46it/s]\n",
                        "18693it [01:16, 244.65it/s]\n",
                        "7507it [00:30, 247.97it/s]\n",
                        "7538it [00:01, 6144.20it/s]\n"
                    ]
                },
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "at epoch 5\n",
                        "train info: logloss loss:1.3198621605398468\n",
                        "eval info: group_auc:0.6272, mean_mrr:0.2846, ndcg@10:0.376, ndcg@5:0.31\n",
                        "at epoch 5 , train time: 103.8 eval time: 115.3\n",
                        "CPU times: user 30min 1s, sys: 2min 34s, total: 32min 35s\n",
                        "Wall time: 18min 22s\n"
                    ]
                },
                {
                    "data": {
                        "text/plain": [
                            "<recommenders.models.newsrec.models.naml.NAMLModel at 0x7f13946c2dd8>"
                        ]
                    },
                    "execution_count": 8,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "%%time\n",
                "model.fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 9,
            "metadata": {},
            "outputs": [
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "18693it [01:16, 245.66it/s]\n",
                        "7507it [00:30, 249.77it/s]\n",
                        "7538it [00:01, 6922.17it/s]\n"
                    ]
                },
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "{'group_auc': 0.6272, 'mean_mrr': 0.2846, 'ndcg@5': 0.31, 'ndcg@10': 0.376}\n",
                        "CPU times: user 4min 2s, sys: 25.8 s, total: 4min 28s\n",
                        "Wall time: 1min 54s\n"
                    ]
                }
            ],
            "source": [
                "%%time\n",
                "res_syn = model.run_eval(valid_news_file, valid_behaviors_file)\n",
                "print(res_syn)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Record results for tests - ignore this cell\n",
                "store_metadata(\"group_auc\", res_syn['group_auc'])\n",
                "store_metadata(\"mean_mrr\", res_syn['mean_mrr'])\n",
                "store_metadata(\"ndcg@5\", res_syn['ndcg@5'])\n",
                "store_metadata(\"ndcg@10\", res_syn['ndcg@10'])"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Save the model"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 11,
            "metadata": {},
            "outputs": [],
            "source": [
                "model_path = os.path.join(data_path, \"model\")\n",
                "os.makedirs(model_path, exist_ok=True)\n",
                "\n",
                "model.model.save_weights(os.path.join(model_path, \"naml_ckpt\"))"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Output Prediction File\n",
                "This code segment is used to generate the prediction.zip file, which is in the same format in [MIND Competition Submission Tutorial](https://competitions.codalab.org/competitions/24122#learn_the_details-submission-guidelines).\n",
                "\n",
                "Please change the `MIND_type` parameter to `large` if you want to submit your prediction to [MIND Competition](https://msnews.github.io/competition.html)."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 12,
            "metadata": {},
            "outputs": [
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "18693it [01:16, 244.49it/s]\n",
                        "7507it [00:30, 249.89it/s]\n",
                        "7538it [00:01, 7159.06it/s]\n"
                    ]
                }
            ],
            "source": [
                "group_impr_indexes, group_labels, group_preds = model.run_fast_eval(valid_news_file, valid_behaviors_file)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 13,
            "metadata": {},
            "outputs": [
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "7538it [00:00, 40033.61it/s]\n"
                    ]
                }
            ],
            "source": [
                "with open(os.path.join(data_path, 'prediction.txt'), 'w') as f:\n",
                "    for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):\n",
                "        impr_index += 1\n",
                "        pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()\n",
                "        pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'\n",
                "        f.write(' '.join([str(impr_index), pred_rank])+ '\\n')"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 14,
            "metadata": {},
            "outputs": [],
            "source": [
                "f = zipfile.ZipFile(os.path.join(data_path, 'prediction.zip'), 'w', zipfile.ZIP_DEFLATED)\n",
                "f.write(os.path.join(data_path, 'prediction.txt'), arcname='prediction.txt')\n",
                "f.close()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Reference\n",
                "\\[1\\] Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang and Xing Xie: Neural News Recommendation with Attentive Multi-View Learning, IJCAI 2019<br>\n",
                "\\[2\\] Wu, Fangzhao, et al. \"MIND: A Large-scale Dataset for News Recommendation\" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html <br>\n",
                "\\[3\\] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/"
            ]
        }
    ],
    "metadata": {
        "celltoolbar": "Tags",
        "interpreter": {
            "hash": "3a9a0c422ff9f08d62211b9648017c63b0a26d2c935edc37ebb8453675d13bb5"
        },
        "kernelspec": {
            "display_name": "Python 3.7.11 64-bit ('tf2': conda)",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.7.11"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 4
}
